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Abstract — This paper represents a method of computing 
incremental difference features on the basis of scan line 
projection and scan converting lines for the lipreading problem 
on a set of isolated word utterances. These features are affine 
invariants and found to be effective in identification of 
similarity between utterances by the speaker in spatial domain. 
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I. Introduction 

An Automatic speech recognition (ASR) for well defined 
applications like dictations and medium vocabulary 
transaction processing tasks in relatively controlled 
environments has been designed. It is observed by the 
researchers that the ASR performance is far from human 
performance in variety of tasks and conditions, indeed ASR 
to date is very sensitive to variations in the environmental 
channel (non-stationary noise sources such as speech 
babbled, reverberation in closed spaces such as car, multi- 
speaker environments) and style of speech (such as whispered 
etc)[l]. 

Lipreading is an auditory, imagery system as a 
source of speech and image information. It provides the re- 
dundancy with the acoustic speech signal but is less vari- 
able than acoustic signals; the acoustic signal depends on 
lip, teeth, and tongue position to the extent that significant 
phonetic information is obtainable using lip movement rec- 
ognition alone [2] [3]. The intimate relation between the au- 
dio and imagery sensor domains in human recognition can 
be demonstrated with McGurk Effect [4] [5]; where the per- 
ceiver "hears" something other than what was said acousti- 
cally due to the influence of conflicting visual stimulus. The 
current speech recognition technology may perform ad- 
equately in the absence of acoustic noise for moderate size 
vocabularies; but even in the presence of moderate noise it 
fails except for very small vocabularies [6] [7] [8] [9] . Humans 
have difficulty distinguishing between some consonants 
when acoustic signal is degraded. 

However, to date all automatic speech reading stud- 
ies have been limited to very small vocabulary tasks and in 
most of cases to very small number of speakers. In addition 
the numbers of diverse algorithms have been suggested in 
the literature for automatic speechreading and are very dif- 
ficult to compare, as they are hardly ever tested on any 
common audio visual databases. Furthermore, most of such 

©2011 ACEEE n 

DOr.01.IJIT.01.01.66 



databases are very small duration thus placing doubts about 
generalization of reported results to large population and 
tasks. There is no specific answer to this but researchers are 
concentrating more on speaker independent audio- visual 
large vocabulary continuous speech recognition systems [10]. 
Many methods have been proposed by researcher's 
in-order to enhance speech recognition system by synchro- 
nization of visual information with the speech as improve- 
ment on automatic lipreading system which incorporates 
dynamic time warping, and vector quantization method ap- 
plied on alphabets, digits. The recognition was restricted to 
isolated utterances and was speaker dependent [2]. Later 
Christoph Bregler (1993) had worked on how recognition 
performance in automated speech perception can be signifi- 
cantly improved & introduced an extension to existing Multi- 
State Time Delayed Neural Network architecture for han- 
dling both the modalities that is acoustics and visual sensor 
input [11]. Similar work has been donebyYuhas et.al (1993) 
& focused on neural network for vowel recognition and 
worked on static images [12]. 

Paul Duchnowski et.al (1995) worked on movement 
invariant automatic lipreading and speech recognition [13], 
Juergen Luettin (1996) used active shape model and hidden 
markov model for visual speech recognition [14]. K.L. Sum 
et.al (2001) proposed a new optimization procedure for 
extracting the point-based lip contour using active shape 
model [16]. Capiler (2001) used Active shape model and 
Kalman filtering in spatiotemporal for noting visual 
deformations [17]. Ian Matthews et.al (2002) has proposed 
method for extraction of visual features of lipreading for 
audio- visual speech recognition [18]. Xiaopeng Hong et.al 
(2006) used PCA based DCT features Extraction method 
for lipreading [19]. Takeshi Saitoh et.al (2008) has analyzed 
efficient lipreading method for various languages where they 
focused on limited set of words from English, Japanese, 
Nepalese, Chinese, Mongolian. The words in English and 
their translated words in above listed languages were 
considered for the experiment [20]; Meng Li et.al (2008) 
has proposed a Novel Motion Based Lip Feature Extraction 
for Lipreading problems [21]. 

The paper is organized in four sections. Section I deals 
with introduction and literature review. Section II deals with 
methodology adopted. Section III discusses results obtained 
by applying methodology and section IV contains conclusion 
of the paper. 
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II. Methodology 

The System takes the input in the form of video (moving 
picture) which is comprised of visual and audio data as shown 
in Figure 1 . This will act as an input to the audio visual speech 
recognition. The samples from the subjects having devnagari 
language as mother tongue have been collected. The isolated 
words of city names in have been pronounced by the 
speakers. 
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Figure 1 : Proposed Model 

The samples from female subject have been chosen. Each 
speaker or subject is requested to begin and end each letter 
utterances for isolated city names with their mouth in closed- 
open-close position. No head movement is allowed and 
speakers have been provided with close up view of their 
mouth and urged to do not move face out of the frame. With 
these constraints the dataset is prepared. This video input 
was acquired by acquisition phase and passed to sampler 
which samples video into frames. The video samples of 
subject were sampled by sampler. This sampling of frame 
was done with the standard rate of 32 frames per second. 
Normally the Video input of 2 seconds was recorded for each 
subject. When these samples were provided to sampler; it 
has produced 64 images for utterance and was considered 
as image vector T of size 64 images and shown in Figure 2. 

The image vector T has to be enhanced because images 
in vector T are dependent on lighting conditions, head 
positions etc. The registration or realignment of image vector 
T was not necessary. The entire sample collected from 
subject was in constrained environment, as discussed above. 
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Figure 2:Subject with utterance of word "MUMBAI", Time 0.02 Sec @ 
32Fps 

Image vector T was processed for color to gray and 
further to binary, with histogram equalization, background 
estimation and image morphological operation by defining 
structural element for open, close, adjust operations. The 
outcome of this preprocessing is shown in Figure 2. 
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Figure 3: Subject with utterance of word "MUMBAI", Time 0.02 Sec @ 

32Fps, Gray to Binary image conversion using Morphological Operation 

with Structure Element 'Disk' 

A. Region of Inter st: 

The identification of Region of Interest (ROI) from binary 
image the scan line projections of row as R(x), columns as 
C(y); were computed as a vectors with respective to every 
frame. The image from vector is represented by two 
dimensional light intensity function F(x,y) returning 
amplitude at an coordinate x,y 



•(1) 



•(2) 



R(*) = Y<Y, F ( x >y') 



C{y) = Y d Y d F(y,x) 



This process suggests the area for segmentation of eyes, 
nose & mouth from the every image of vector. This was found 
to be helpful in classifying open-close-open mouth of the 
subject as well as some geometrical features such as height, 
width of mouth in every frame can easily be computed. 
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Figure 4 (a) Horizontal (row) and vertical (column) scan line Projections 

of Face to locate facial components like eyes, eyebrows, nostrils, nose, 

mouth, (b) Isolation of Mouth Region 

The masking was done so as to reduce workspace. When 
the R(x) and C(y) were plotted, the plot represents the face 
components like eyes, nose, and mouth. The masking 
containing mouth region was framed in accordance with very 
first image in vector T , this was accomplished by computing 
horizontal scan line projection (row projections) and vertical 
scan line projections (column projections) as discussed above. On 
source vector T, it was observed that the face components like 
eyes, eyebrows, nose, nostrils, and mouth could be easily be 
isolated. The region of interest, that was mouth can easily be located 
as show in Figure 4 (a) and it was very easy to segment into three 
parts like eyes, nose, mouth, as show in Figure 4 (b). The masking 
remained constant for all remaining images of the vector and 
window coordinate containing mouth was fixed for Mask. This 
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was applied to all frames of Image vector so that mouth frame 
from source image was extracted. The result of windowing 
operation was resulted in vector called 'W as shown in Figure 5 



Figure 5: Masking result of Subject with word "MUMBAI", Time 0.02 
Sec @ 32Fps 
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Figure 6: HRL and VRL for identification for Incremental Difference 

To extract the lip features from frame vector there have 
been two approaches. The First one is low level analysis of 
image sequence data which does not attempt to incorporate 
much prior knowledge. Another approach is a high level 
approach that imposes a model on the data using prior 
knowledge. Typically high level analysis uses lip tracking 
to extract lip shape information alone. The feature extraction 
was carried out by low level analysis by directly processing 
image pixels and is implicitly able to retrieve additional 
features that maybe difficult to track such as teeth and tongue 
[22]. 

Low level analysis is adopted in order to compute the 
features. The Horizontal Reference Line (FIRL) and Vertical 
Reference Line (VRL) for the lip are plotted. The points for 
HRL and VRL have been chosen from scan line projection 

vectorsthat isR(x) andC(y). Theinitial valuesfor P, which 

were the midpoint for HRL was calculated as 
X p = 2 - x L )/ 2 and y p = (l" 2 - yJJ 2 where x 2 , Xj 
are the co-ordinates obtained from R(x) and y2,Vi are 
obtained for C(y). Tj3e_imld_al_yjdues ^X°.t„P2_.3Yiy£lj_i§„Sll£, 
Sy^°]SLfo£_YBii_i§_£Sl?- u l a l . e 4 §§_ thSLSfJiii Therefore to 
obtain exact midpoint of HRL with reference to Pj 

at (.T , y ) and VRL with reference to P 2 at (.v , y ) .the 

HRL & VRL are represented by the line implicit function 
with coefficient a, b and c : F(_x, y) = ax + by + c = 
(the b coefficient of y is unrelated to the y intercept B in 
the slope intercept form). If dy = y 2 — V\ an d 
dx = A", — X, the slope intercept form can be written as 
dy 



clx 



B\ 



Therefore 



F(x, y) = dy ■ x - dx y + Bdx = 
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Here a — dy, b — —dx and c — B ■ dx in the implicit 
form as it is important for the proper functioning of the 
midpoint of HRL to choose 'a' to be positive; so that it 

meets this criterion if dy is positive, since|v, > V, . 

To calculate midpoint criterion for HRL and VRL as for 
the pixel point Pj and P 2 as we need to compute F HRL ( 
and Fy RL (AI) as 
1 



.(3) 



• (4) 



F HRL (M) = F(x p +Ly p +-) 

F !rRL (M) = F(x p +Ly p+ ^ 
The decision is based on the value of the function at 

z it is necessary to define decision 
variable for HRL and VRL respectively 

I ") (5) 



d = F(x 



p r 

Therefore by definition 
1 



d = a(x p + l)+b(y B +-)+c 



.(6) 



p r 

Conditions 
If d>0 then we choose pixel NE (North East) 
If d<0 then we choose pixel E (East) 

If d=0 then we can choose either, recommended to choose 
E 

The location of M is on whether we chose E or NE, if E is 
chosen, and then M is incremented by one step in x 
direction then 

d new =F(x p + Zy p+ ±)=c(x p + 2)+Kv p +\)+c....V) 



But. 



d 



old 



a(x.+l) + b(y. + -) + c 



..(8) 



p ' ' K - p 2' 
If NE is chosen, M is incremented by one step each in both 
x and y direction then 

^=F(^+2 5 ^+|=a(^+2)+Ky p +|+a..(?) 

By equation (3) and (4) [23] with support of decision 
variable new coordinates for P^x, y) for HRL and P,(x,y) 
for VRL are computed. The pixel PI was at the middle of 
HRL and pixel P, lies at the middle of VRL. The difference 
between the P and P 7 is considered as incremental difference 
feature and will be unique feature for the frame. This feature 
is invariant to scale, rotation and scaling. This difference is 
computed for all frames for utterance of word and stored in 
vector; this vector will be referred as feature vector for the 
word. The feature vector will contain the information of all 
samples of word such as { AURANGABAD,MUMBAI, 
PARBHANI, KOLHAPUR, and OSMANAB AD } . 

HI. Result And Discussion 

The midpoint (M) for HRL and VRL has been chosen on the 
basis of above discussed method. The pixel P and P 7 are 

with new (x , y p ) respectively and are marked as landmark 
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points on the every frame of vector as shown in Figure 7. 




TABLE II 
EUCLIDEAN DISTANCE MATRIX FOR WORD AURANGABAD 



Figure 7 : Marking of all landmark points 

The pixel difference between P { and P 7 was re- 
corded as feature of the frame and similar difference with 
respect to all frames of vector have been computed and stored 
in the feature vector. The feature vector corresponding to all 
utterance of the word 'AURANGABAD' is formed, their 
mean feature vector is also calculated. The Euclidean dis- 
tance between mean feature vector and computed feature 
vector has been computed and represented in Table I. From 
table I, it is observed that the sample 1 and sample 2 of the 
word 'Aurangabad' are found to be similar and sample 4 
and sample 5 are also found to be similar, the sample 8 and 
sample 9 are same. The similar kinds of results were ob- 
tained for the other samples of the words uttered by the 
speaker. The Table II represents the Euclidean distance 
metrics for the word 'AURANGABAD' by the speaker 1. 
The Graph I shows similarity between the Maxima and 
Minima from the feature vectors of Sample 1 and Sample 2 
of the word 'AURANGBAD' uttered by the speaker 1 and 
the Graph II shows how two words that is 'AURANGABAD' 
and 'KOLHAPUR' uttered by same speaker are different on 
the basis of computed feature vectors and Maxima and 
Minima observed in graph II. The mean feature vectors of 
these words are plotted. This feature vector is formed with 
the help of incremental difference procedure 

TABLE I 

EUCLIDEAN DISTANCE BETWEEN MEAN FEATURE OF EACH WORD AND 

FEATURE VECTOR FOR THE WORD UTTERED BY SPEAKER 1 



Sample 


Aurangabad 


Mumbai 


Kolhapur 


Parbhani 
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25.74 


16.34 


22.49 
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-} 
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23.22 


61.12 


24.02 
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43.89 


34.58 
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19.26 
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47.39 
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45.16 



Word 


1 


- 


3 


4 


5 


1 





25.74 


25.86 


43.89 


31.87 


2 


25.74 





31.17 


33.61 


30.18 


3 


25.86 


31.17 





27.67 


27.67 


4 


43.89 


33.61 


27.67 





47.39 


5 


31.87 


30.18 


27.67 


47.39 






GRAPH I. SIMILARITY BETWEEN THE TJTJERENCES OF WORD 
'AURANGABAD' BY THE SPEAKER 1 
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GRAPH II. DIFFERENCE BETWEEN UTTERENCES OF WORD -.AURANGABAD' 
AND 'KOLHAPUR' BY SPEAKER 1 
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The similar results are observed from the other samples of 
speakers. 

iv. CONCLUSION 

The incremental difference a novel method for feature 
extraction for audio- visual speech recognition and it is found 
to be suitable for the enhancement of speech recognition. 
This method helps in differentiating the words spoken by 

the speaker. 
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